More Polars plan optimizations for TPC-DS by Matt711 · Pull Request #22395 · rapidsai/cudf

Matt711 · 2026-05-06T13:40:01Z

Description

Hand-tune polars_impl for 19 TPC-DS benchmark queries in python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/. Each rewrite preserves query semantics and only changes how the polars LazyFrame is constructed; duckdb_impl is unchanged.

The optimizations apply a small set of recurring patterns that the polars optimizer does not (yet) perform automatically:

Predicate pushdown on dimension tables — pre-filter date_dim, item, store, etc. by literal predicates (year, quarter, month window, category/class/brand) before any join, so the join builds smaller hash tables.
Semi-join fact-table pre-filtering — use selective dimension keys (and in some cases store_returns (customer, item) pairs) as semi-join probes against the fact tables, shrinking them before the expensive joins.
Projection pushdown — select(...) only the columns each table contributes before joining, instead of relying on the planner to prune them later.
Condition-join → equi-join — replace cross-join + filter and CONDITIONALJOIN-style patterns with constant-key equi-joins where the predicate is equivalent.
Single-pass bucket aggregation — collapse multiple independent global-sum group-bys over the same fact table into one pass that emits the values in a single aggregation, replacing N scans with 1.
Join reordering — defer non-selective joins (e.g. customer) until after the selective filter chain so the row count entering the deferred join is much smaller.

Test plan

Run TPC-DS validation against DuckDB on the 19 modified queries
Run benchmark sweep and confirm no regressions vs. main on unmodified queries
Confirm result equality (sorted output) matches DuckDB reference

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-05-06T13:40:05Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-13T18:17:24Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 92acf26e-76aa-4b5a-b201-8a5ca6550b0d

📥 Commits

Reviewing files that changed from the base of the PR and between e1bb3be and f1a800d.

📒 Files selected for processing (1)

python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.py

🚧 Files skipped from review as they are similar to previous changes (1)

python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.py

📝 Walkthrough

Summary by CodeRabbit

Refactor
- Optimized multiple benchmark queries by pre-filtering lookup/date/item tables, using selective semi-joins, and projecting only required columns to reduce joins, shuffles, and memory overhead.
- Preserved existing results, ordering, and output shapes while improving query selectivity and throughput.

Walkthrough

Refactors many polars_impl TPC‑DS benchmark queries to apply early filtering and column projection, use semi-joins and derived pair sets, and reorder joins/aggregations while preserving final result expressions.

Changes

Polars Query Optimization Pattern

Layer / File(s)	Summary
Basic dimension pre-filtering `q8.py`, `q43.py`, `q52.py`, `q55.py`, `q76.py`, `q98.py`, `q53.py`	Dimension tables (`date_dim`, `item`, `store`, etc.) are filtered and projected before joining to fact tables; queries use semi-joins or reduced projections rather than joining full tables then filtering.
Date window and weekday aggregation `q2.py`, `q14.py`, `q67.py`	`date_dim` is restricted to explicit multi-year/month windows; weekday/week aggregations and week-based pipelines operate on these reduced date sets and narrower column selections.
Fact-pair derivation and semi-join qualification `q17.py`, `q25.py`, `q29.py`	Derive reduced qualifying key-pair frames (e.g., `(customer_sk,item_sk)` or ticket keys) from filtered returns, then semi-join those to pre-filter `store_sales`/`catalog_sales` before attaching item/store dimensions.
Join reordering, bucketing, and aggregation changes `q9.py`, `q18.py`, `q23.py`, `q44.py`, `q63.py`, `q88.py`	Larger restructures: join order changes, bucketization/threshold computation introduced, cross-join replacements, and aggregation/ranking pipelines rewritten to use earlier filtering and narrower frames.

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

rjzamora
TomAugspurger
wence-

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: hand-tuned Polars query plan optimizations for 19 TPC-DS benchmark queries.
Description check	✅ Passed	The description is well-related to the changeset, explaining the optimization patterns applied and the testing plan for the 19 modified queries.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.py`:
- Line 121: The ranking uses Polars' ordinal method which gives unique ranks and
diverges from SQL RANK() semantics; update both occurrences where
pl.col("avg_profit").rank(method="ordinal").alias("rnk") is used (and any
subsequent filters like rnk < 11) to use method="min" instead so tied avg_profit
values receive the same rank with gaps (matching DuckDB/SQL RANK behavior).

In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q76.py`:
- Line 177: The aggregation pl.col("ext_sales_price").sum().alias("sales_amt")
returns 0 for all-null groups in Polars, which differs from SQL/DuckDB; replace
this with a conditional aggregation that checks
pl.col("ext_sales_price").count() > 0 and only returns the sum when count>0,
otherwise returns None, mirroring the null-sum handling used in q1/q49 so that
the "sales_amt" column matches SQL semantics.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4b334632-2668-4f86-af10-2ff2b1dba11b

📥 Commits

Reviewing files that changed from the base of the PR and between ff37ba8 and e94077b.

📒 Files selected for processing (19)

python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q14.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q17.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q18.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q2.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q23.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q25.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q29.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q43.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q52.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q53.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q55.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q63.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q67.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q76.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q8.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q88.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q9.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q98.py

coderabbitai

🧹 Nitpick comments (2)

python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q25.py (1)
121-121: ⚡ Quick win

Deduplicate the semi-join probe keys.

Same pattern here: sr_customer_item is acting as an existence set for two semi joins, so keeping duplicate (sr_customer_sk, sr_item_sk) rows only makes the join builds heavier. A .unique() should make the prefilter cheaper without changing semantics.
♻️ Proposed change
-    sr_customer_item = store_returns_filtered.select(["sr_customer_sk", "sr_item_sk"])
+    sr_customer_item = store_returns_filtered.select(
+        ["sr_customer_sk", "sr_item_sk"]
+    ).unique()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q25.py`
at line 121, sr_customer_item currently keeps duplicate (sr_customer_sk,
sr_item_sk) rows used as an existence set for two semi-joins; make the prefilter
cheaper by deduplicating it. Update the assignment for sr_customer_item (from
store_returns_filtered.select(["sr_customer_sk", "sr_item_sk"])) to call
.unique() (or the equivalent drop_duplicates()) on the resulting frame so
duplicates are removed before using sr_customer_item in the semi-joins.
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q29.py (1)
126-126: ⚡ Quick win

Deduplicate the semi-join probe keys.

sr_customer_item is only used as the RHS of how="semi" joins, so duplicate (sr_customer_sk, sr_item_sk) pairs cannot change results but can still bloat both downstream join builds. Adding .unique() here should reduce the work in the hottest part of this rewrite.
♻️ Proposed change
-    sr_customer_item = store_returns_filtered.select(["sr_customer_sk", "sr_item_sk"])
+    sr_customer_item = store_returns_filtered.select(
+        ["sr_customer_sk", "sr_item_sk"]
+    ).unique()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q29.py`
at line 126, sr_customer_item currently selects ["sr_customer_sk","sr_item_sk"]
from store_returns_filtered but is only used as the RHS of semi-joins
(how="semi"), so duplicate (sr_customer_sk, sr_item_sk) pairs only bloat
downstream joins; change the creation to deduplicate the probe keys by applying
.unique() to the selection (i.e., replace the current
store_returns_filtered.select([...]) usage for sr_customer_item with
store_returns_filtered.select([...]).unique()) so the semi-join builds operate
on distinct keys.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q25.py`:
- Line 121: sr_customer_item currently keeps duplicate (sr_customer_sk,
sr_item_sk) rows used as an existence set for two semi-joins; make the prefilter
cheaper by deduplicating it. Update the assignment for sr_customer_item (from
store_returns_filtered.select(["sr_customer_sk", "sr_item_sk"])) to call
.unique() (or the equivalent drop_duplicates()) on the resulting frame so
duplicates are removed before using sr_customer_item in the semi-joins.

In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q29.py`:
- Line 126: sr_customer_item currently selects ["sr_customer_sk","sr_item_sk"]
from store_returns_filtered but is only used as the RHS of semi-joins
(how="semi"), so duplicate (sr_customer_sk, sr_item_sk) pairs only bloat
downstream joins; change the creation to deduplicate the probe keys by applying
.unique() to the selection (i.e., replace the current
store_returns_filtered.select([...]) usage for sr_customer_item with
store_returns_filtered.select([...]).unique()) so the semi-join builds operate
on distinct keys.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 800a9d03-24c5-465a-8998-3ee5af1eaa20

📥 Commits

Reviewing files that changed from the base of the PR and between e94077b and e1bb3be.

📒 Files selected for processing (19)

python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q14.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q17.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q18.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q2.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q23.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q25.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q29.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q43.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q52.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q53.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q55.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q63.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q67.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q76.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q8.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q88.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q9.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q98.py

🚧 Files skipped from review as they are similar to previous changes (17)

python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q76.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q52.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q67.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q53.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q17.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q8.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q9.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q98.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q88.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q14.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q43.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q23.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q63.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q55.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q2.py
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q18.py

Matt711 · 2026-05-14T02:15:20Z

/ok to test f1a800d

More Polars plan optimizations for TPC-DS

d86d0d8

Matt711 added Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 6, 2026

github-actions Bot assigned Matt711 May 6, 2026

github-actions Bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels May 6, 2026

github-project-automation Bot added this to cuDF Python May 6, 2026

GPUtester moved this to In Progress in cuDF Python May 6, 2026

quasiben mentioned this pull request May 11, 2026

[DO NOT MERGE] PDS DS ALL #22469

Draft

Matt711 marked this pull request as ready for review May 13, 2026 18:11

Matt711 requested a review from a team as a code owner May 13, 2026 18:11

Matt711 requested a review from rjzamora May 13, 2026 18:11

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.py Outdated

Comment thread python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q76.py

merge conflict

e1bb3be

Matt711 force-pushed the imp/pdsds/more-pds-optimizations branch from e94077b to e1bb3be Compare May 14, 2026 02:02

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

use min for tie-breaking in RANK()

f1a800d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More Polars plan optimizations for TPC-DS#22395

More Polars plan optimizations for TPC-DS#22395
Matt711 wants to merge 3 commits into
rapidsai:mainfrom
Matt711:imp/pdsds/more-pds-optimizations

Matt711 commented May 6, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 6, 2026

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Changes

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Matt711 commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Matt711 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test plan

Checklist

Uh oh!

copy-pr-bot Bot commented May 6, 2026

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Matt711 commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Matt711 commented May 6, 2026 •

edited

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading